Goto

Collaborating Authors

 speaker detection and speech enhancement


Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

#artificialintelligence

Such audio-visual event not only plays a critical role for human perception in our social life, but also is involved in diverse human-computer interaction scenarios, e.g., multi-modal robot dialogue system or in-vehicle AI navigation system. As shown in Fig.1, when driving an autonomous vehicle, we can easily do some interactive operations with the intelligent driver assistance system, which is privately designated by the driver. But in many cases, the noises coming from the rear may become a kind of interference signal that affects such a human-computer interaction process, and frequently influence the intelligent assistant from accurately extracting the driver's instructions and responding accordingly. Therefore, the current limitations in audio-visual interactions can be highlighted as follows for more effective solution investigation: 1) Identify the voice of the target speaker in the mixed audio signals, and it must not be disturbed by interruptions from other speakers; 2) Perform speech enhancement to the target speaker's voice while ignoring the background noises, and extracting the target speaker's command; 3) How should the intelligent assistant accurately recognize the speech of the target when a new candidate who has not pre-registered the voice information in advance appears.